-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Master node disconnect #132023
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Master node disconnect #132023
Conversation
Extends the `Coordinator.handleJoinRequest` `onFailure` method in the callback to only fail if the node is not in already in the `ClusterState`. This avoids a rare corner case where a master node's `ClusterState` already has a node in it that is attempting to join the cluster, and it logs an incorrect error message
fbc0026
to
962b3b2
Compare
Adds two new integration test suites, `NodeJoiningIT` and `NodeJoiningMasterElectionIT`. These included tests related to the coordinator logic when a node joins the cluster. There is also a proposed solution in the Coordinator class, currently commented out. This will be uncommented in a follow up commit
dcfb910
to
3e501b4
Compare
|
||
try { | ||
ensureSufficientMasterEligibleNodes(); | ||
DiscoveryNode masterNode = internalCluster().clusterService().state().nodes().getMasterNode(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding test flakiness, one contributor could be where you get the info that you use in your assertions. Using internalCluster().clusterService()
doesn't necessarily give you the most up to date value since it might not fetch the instance from the current master node. This is documented on that method, but want to make sure you have consider this random behaviour that this utility can have. You can fetch the master node instance by providing the master node name or similar to the test in TransportMasterNodeActionIT
, which I assume this is partially based on, use a master client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having tried that out briefly, I think the test failure is on longer due to the latch you mentioned but rather log expectation assertions. Hope that hleps!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(at least based on 20-something runs that used to get 8-10 failures in the latch check, which now only fails once later at the log expectation assertion)
test that the warn log is not present
d967904
to
dac102a
Compare
89f06d5
to
d269e47
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unclear if this is ready for review or not - it's still marked as a draft but you've requested reviews.
I note that you're using force-push rather than merge:

Please don't do that once other folks have started to look at a PR unless absolutely unavoidable. It makes it much harder to track the history of changes against the comments.
* @param cleanupTasks The list of cleanup tasks | ||
* @return A latch that will be released when the old master acknowledges the new master's election | ||
*/ | ||
protected CountDownLatch configureElectionLatchForNewMaster(String newMaster, List<Releasable> cleanupTasks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this'd be much simpler implemented with ClusterServiceUtils.addTemporaryStateListener
. Not sure why we didn't do so when first written - I think it must have evolved to this from something more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, have updated in a668ab1
* @param cleanupTasks The list of clean up tasks | ||
* @return A cyclic barrier which when awaited on will un-block the applier | ||
*/ | ||
protected static CyclicBarrier blockClusterStateApplier(String nodeName, ArrayList<Releasable> cleanupTasks) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I don't think you need to pull this one up to the base class, it only has one caller.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
* @param cleanupTasks The list of cleanup tasks | ||
* @return A latch that will be released when the master acknowledges it's re-election | ||
*/ | ||
protected CountDownLatch configureElectionLatchForReElectedMaster( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you need a separate method here? In the new tests I think we can just wait for the term to increase, as observed by any node - no need to worry about which node is master or anything so fiddly.
Also this one is only called from one place. I think it's a premature optimization to generalize these two test suites like this when they have so little in common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
* master accepts its own failed state update before standing down, we can still | ||
* establish a quorum without its (or our own) join. | ||
*/ | ||
protected static String ensureSufficientMasterEligibleNodes() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new tests don't need 5 master nodes, and indeed having 5 (rather than 3) makes the situation unnecessarily complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
1. (T, V+1) is accepted -> NodeConnectionsService now stores an open connection to N. It can be closed. | ||
2. (T, V+1) is rejected -> A new cluster state is published without N in it. | ||
It is right to close the connection and retry. | ||
3. The above scenario occurs. We do not close the connection after (T, V+1) fails and keep it open: | ||
3.1 (T+1, V+2) is accepted -> By waiting, we did not close the connection to N unnecessarily | ||
3.2 (T+1, V+2) is rejected -> A new cluster state is published without N in it. Closing is correct here. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, I think that makes sense and is much simpler than I anticipated.
public void clusterChanged(ClusterChangedEvent event) { | ||
// Now it's safe to close the connection | ||
Releasables.close(response); | ||
// Remove this listener to avoid memory leaks | ||
clusterService.removeListener(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't quite work tho, because we apply a cluster state when the master fails too. The state we apply in that case differs from the last committed state only in the values of things which are local properties of the state, i.e. which do not have strong consistency guarantees, such as DiscoveryNodes#masterNodeId
. I think we need to keep the connection open until we've applied the next committed state (i.e. one with a non-null master
).
I did see the new tests fail in this way when run in a loop, but unfortunately I didn't have things configured to collect a build scan. I'm trying again.
Also we should only insert this listener when the update fails with a FailedToCommitClusterStateException
- for all other outcomes there's no need to wait for another state to be committed, indeed we cannot expect another state to be committed any time soon if the cluster is stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I've changed this logic to only close the connection when a committed state is seen, but how would I know whether a FailedToCommitClusterStateException
is thrown?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rather than using ActionListener#runBefore
you'd typically capture the exception with ActionListener#delegateResponse
. But in this case we're already capturing the success response with ActionListener#delegateFailure
so it's probably best to just create a completely new ActionListener
that does the right thing with both success and exceptional responses.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
|
||
// exposed for tests | ||
boolean missingJoinVoteFrom(DiscoveryNode node) { | ||
logger.info("Missing vote from: {}", node.getName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stray logging leftover from debugging? I don't think we want an INFO log here, it will raise support cases asking what it means.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
* @param name of the node which existence should be verified | ||
* @return <code>true</code> if the node exists. Otherwise <code>false</code> | ||
*/ | ||
public boolean nodeExistsWithName(String name) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is only used in tests, I don't think there's a need to have it in the production codebase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in a668ab1
private String generateNodeDescriptionForNewDiscoveryNode(int numberOfNodesOriginallyInCluster, DiscoveryNode masterNode) { | ||
// Nodes are named `node_s0`, `node_s1` etc ... | ||
// Therefore, if there are N nodes in the cluster, named `node_s0` ... `node_sN-1`, N+1 will be named `node_sN` | ||
String newNodeName = "node_s" + numberOfNodesOriginallyInCluster; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't correct, sometimes the new node will be of the form node_tN
. There may be other possibilities too even today, I'm not sure, and we certainly don't know what'll happen in future. More generally, this is heavily overspecifying the conditions we actually want, and coupling this test to the implementation details of how nodes are named in the cluster. If we changed the format of these messages at all, we'd have to update this code too.
I think we should be asserting that we expect to see no messages at all from NodeJoinExecutor
at level WARN
. The content isn't really relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I have updated in cc67155
- Removes MasterElectionTestCase - reverts TransportMasterNodeActionIT - Modifies NodeJoiningIT to use ClusterStateUtils rather than CountDownLatches - Modifies the Coordinator to keep the connection open only when there is a FailedToCommitClusterStateException and until the next committed cluster state update
…asticsearch into master-node-disconnect
Thank you David for your help on this! Appreciate there's been a bit of churn |
Releasables.close(response); | ||
joinListener.onFailure(e); | ||
// Immediate condition check in case another node is elected master | ||
if (clusterService.state().nodes().nodeExists(joinRequest.getSourceNode().getId())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need to fail the join listener and remove the cluster-state listener if the last-applied state is committed (i.e. has a master node) and doesn't include the joining node.
Does it work to call clusterStateListener.clusterChanged(new ClusterChangedEvent("", clusterService.state(), clusterService.state()))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it work
Hmm no, not quite: that might complete ll
twice which is generally something we should try and avoid. It turns out that as things are implemented today ll
will be a SubscribableListener
which has well-defined semantics when completed multiple times:
elasticsearch/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
Lines 44 to 45 in cdc7474
* If this listener is completed more than once then all results other than the first (whether successful or otherwise) are silently | |
* discarded. All subscribed listeners will be notified of the same result, exactly once, even if several completions occur concurrently. |
However we can't rely on that being true in future, it's not guaranteed that ll
will always be a SubscribableListener
in this context. I'd prefer we explicitly deduplicated this work e.g. by creating another SubscribableListener
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However we can't rely on that being true in future
I contemplated making it not be true by adding a check that these listeners are not completed multiple times:
diff --git a/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java b/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
index 9eddbb55b776..b7c8c4a28279 100644
--- a/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
+++ b/server/src/main/java/org/elasticsearch/action/support/SubscribableListener.java
@@ -135,7 +135,7 @@ public class SubscribableListener<T> implements ActionListener<T> {
*/
public static <T> SubscribableListener<T> newForked(CheckedConsumer<ActionListener<T>, ? extends Exception> fork) {
final var listener = new SubscribableListener<T>();
- ActionListener.run(listener, fork::accept);
+ ActionListener.run(ActionListener.assertOnce(listener), fork::accept);
return listener;
}
However, on reflection this seems unnecessarily strict and indeed it causes the SubscribableListener
test suite to fail because we actually already assert that newForked
and andThen
receive the returned listener instance, which is therefore safe to complete more than once. I think it's best to document this fact, see #133391, and then we can rely on it here too (so disregard my previous message)
safeAwait(publishingBanRemovedListener); | ||
logger.info("Master publishing ban removed"); | ||
// Assert the master was re-elected | ||
assertTrue(masterNodeName.equals(internalCluster().getMasterName()) && originalTerm < getTerm(masterNodeName)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: slight preference for this being two assertThat
calls: if this fails, we get no feedback about why, whereas if we use assertThat
twice then we'll be able to see whether it was the master that changed (and it'll identify the new master) or whether the term didn't increase.
// Another node was elected, and doesn't have the node in it | ||
if (clusterService.state().nodes().getMasterNode() != null | ||
&& clusterService.state().nodes().nodeExists(joinRequest.getSourceNode().getId()) == false) { | ||
// Remove this listener to avoid memory leaks | ||
clusterService.removeListener(clusterStateListener); | ||
ll.onFailure(e); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like it'd be simpler to call clusterStateListener.clusterChanged
directly here, constructing a "fake" ClusterChangedEvent to carry the current cluster state. I think we want the same logic both ways: particularly if the node is for some reason in the current cluster state at this point then we can complete ll
successfully too.
Node N should join the cluster, but it should not be disconnected (#ES-11449) | ||
*/ | ||
@TestLogging(reason = "test includes assertions about logging", value = "org.elasticsearch.cluster.coordination:INFO") | ||
public void testNodeTriesToJoinClusterAndThenSameMasterIsElected() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw this test is a really good discriminator of the improvement, it failed for me 9 out of 10 times I ran it having reverted the behaviour change in Coordinator
, and passed 10 out of 10 with the change in place. Great stuff.
|
||
if (discoveryNodes.nodeExists(joinRequest.getSourceNode().getId())) { | ||
// Remove this listener to avoid memory leaks | ||
clusterService.removeListener(this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why have this within the if
branches? It's on both branches so we call it either way, and I don't see a need to call it after calling nodeExists
or anything like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
This reverts commit 656a7a9.
This reverts commit 656a7a9.
Unmutes `DataTierAllocationDeciderIT .testDesiredNodesAreTakenIntoAccountInAutoExpandReplicas` and `DataTierAllocationDeciderIT .testShardsAreKeptInPreferredTierUntilTheNextTierIsInItsFinalState` since the problematic commit elastic#132023 has been reverted
Unmutes `DataTierAllocationDeciderIT .testDesiredNodesAreTakenIntoAccountInAutoExpandReplicas` and `DataTierAllocationDeciderIT .testShardsAreKeptInPreferredTierUntilTheNextTierIsInItsFinalState` since the problematic commit #132023 has been reverted
Extends the Coordinator so that we don't prematurely close the connection to a joining node. This prevents a
node-join: [{}] with reason [{}]; for troubleshooting guidance, see {}
WARN
log being emitted unnecessarily.Closes #126192
Jira Ticket - ES-11449